The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
# To load and manipulate data
import pandas as pd
import numpy as np
# To visualize data
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# Import evalutaion metrices for classification
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix )
%matplotlib inline
# To ignore warnings
import warnings
warnings.filterwarnings('ignore')
# Import SMOTE package for oversampling
from imblearn.over_sampling import SMOTE
# Import Random undersampler for undersampling
from imblearn.under_sampling import RandomUnderSampler
# Import packages to test and and split
from sklearn.model_selection import train_test_split
# Import package for KNNImputer
from sklearn.impute import KNNImputer
# Import package for Randomizedsearchcv
from sklearn.model_selection import RandomizedSearchCV
# Import MinMax scaler package
from sklearn.preprocessing import MinMaxScaler
# Import package GridSrearchCV
from sklearn.model_selection import GridSearchCV
# Import packages for classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
#To install xgboost
#!pip install xgboost
from xgboost import XGBClassifier
RS = 1
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Read .cvs file in dataframe
orig_df = pd.read_csv('/content/drive/MyDrive/python/Advanced Machine Learning/Projects/BankChurners.csv')
# creating a copy of the data
bankchurn_df = orig_df.copy()
# Print the head of data set
bankchurn_df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
# Print the Tail of data set
bankchurn_df.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | ... | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | ... | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | ... | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | ... | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | ... | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
5 rows × 21 columns
# Print Random rows of data set
bankchurn_df.sample(5,random_state=1)
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6498 | 712389108 | Existing Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Blue | 36 | ... | 3 | 2 | 2570.0 | 2107 | 463.0 | 0.651 | 4058 | 83 | 0.766 | 0.820 |
| 9013 | 718388733 | Existing Customer | 38 | F | 1 | College | NaN | Less than $40K | Blue | 32 | ... | 3 | 3 | 2609.0 | 1259 | 1350.0 | 0.871 | 8677 | 96 | 0.627 | 0.483 |
| 2053 | 710109633 | Existing Customer | 39 | M | 2 | College | Married | $60K - $80K | Blue | 31 | ... | 3 | 2 | 9871.0 | 1061 | 8810.0 | 0.545 | 1683 | 34 | 0.478 | 0.107 |
| 3211 | 717331758 | Existing Customer | 44 | M | 4 | Graduate | Married | $120K + | Blue | 32 | ... | 3 | 4 | 34516.0 | 2517 | 31999.0 | 0.765 | 4228 | 83 | 0.596 | 0.073 |
| 5559 | 709460883 | Attrited Customer | 38 | F | 2 | Doctorate | Married | Less than $40K | Blue | 28 | ... | 2 | 4 | 1614.0 | 0 | 1614.0 | 0.609 | 2437 | 46 | 0.438 | 0.000 |
5 rows × 21 columns
# Print the rows and columns of data set
print("Number of Rows: ", bankchurn_df.shape[0])
print("Number of Columns: ", bankchurn_df.shape[1])
Number of Rows: 10127 Number of Columns: 21
# Print the dataset info
bankchurn_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# Print the statistical summary including categorial variables
bankchurn_df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.0 | NaN | NaN | NaN | 739177606.333663 | 36903783.450231 | 708082083.0 | 713036770.5 | 717926358.0 | 773143533.0 | 828343083.0 |
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.0 | NaN | NaN | NaN | 46.32596 | 8.016814 | 26.0 | 41.0 | 46.0 | 52.0 | 73.0 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.0 | NaN | NaN | NaN | 2.346203 | 1.298908 | 0.0 | 1.0 | 2.0 | 3.0 | 5.0 |
| Education_Level | 8608 | 6 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 9378 | 3 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 10127 | 6 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.0 | NaN | NaN | NaN | 35.928409 | 7.986416 | 13.0 | 31.0 | 36.0 | 40.0 | 56.0 |
| Total_Relationship_Count | 10127.0 | NaN | NaN | NaN | 3.81258 | 1.554408 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| Months_Inactive_12_mon | 10127.0 | NaN | NaN | NaN | 2.341167 | 1.010622 | 0.0 | 2.0 | 2.0 | 3.0 | 6.0 |
| Contacts_Count_12_mon | 10127.0 | NaN | NaN | NaN | 2.455317 | 1.106225 | 0.0 | 2.0 | 2.0 | 3.0 | 6.0 |
| Credit_Limit | 10127.0 | NaN | NaN | NaN | 8631.953698 | 9088.77665 | 1438.3 | 2555.0 | 4549.0 | 11067.5 | 34516.0 |
| Total_Revolving_Bal | 10127.0 | NaN | NaN | NaN | 1162.814061 | 814.987335 | 0.0 | 359.0 | 1276.0 | 1784.0 | 2517.0 |
| Avg_Open_To_Buy | 10127.0 | NaN | NaN | NaN | 7469.139637 | 9090.685324 | 3.0 | 1324.5 | 3474.0 | 9859.0 | 34516.0 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | NaN | NaN | NaN | 4404.086304 | 3397.129254 | 510.0 | 2155.5 | 3899.0 | 4741.0 | 18484.0 |
| Total_Trans_Ct | 10127.0 | NaN | NaN | NaN | 64.858695 | 23.47257 | 10.0 | 45.0 | 67.0 | 81.0 | 139.0 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | NaN | NaN | NaN | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
Observations
# Check if data set has any missing values
bankchurn_df.isnull().sum()
| 0 | |
|---|---|
| CLIENTNUM | 0 |
| Attrition_Flag | 0 |
| Customer_Age | 0 |
| Gender | 0 |
| Dependent_count | 0 |
| Education_Level | 1519 |
| Marital_Status | 749 |
| Income_Category | 0 |
| Card_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
# Check if dataset has duplicate rows
bankchurn_df.duplicated().sum()
0
# Drop the ID column from the data set as it is not required in further EDA and Data Modeling
bankchurn_df.drop('CLIENTNUM', axis=1, inplace=True)
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# Create the Histogram and boxplot for Age of customers
histogram_boxplot(bankchurn_df, 'Customer_Age')
Observations
# Check count per Gender type in Gender column
print(f"Gender Count = {bankchurn_df['Gender'].value_counts()}");
# Percentage of Genders among customers
labeled_barplot(bankchurn_df, 'Gender', perc=True)
Gender Count = Gender F 5358 M 4769 Name: count, dtype: int64
Observations
The total number of female customers is 5,358, which constitutes approximately 52.9% of the customer base.
The total number of male customers is 4,769, which constitutes approximately 47.1% of the customer base.
# Check all unique dependents exists in Dependent_count column
print(f"Total number of dependent type = {bankchurn_df['Dependent_count'].nunique()} with distinct dependent sizes {bankchurn_df['Dependent_count'].unique()}")
# Percentage of total dependents per dependent type
labeled_barplot(bankchurn_df, 'Dependent_count', perc=True)
Total number of dependent type = 6 with distinct dependent sizes [3 5 4 2 0 1]
Observations
# Check all unique Education Levels exists in Education column
print(f"Total number of Education Levels = {bankchurn_df['Education_Level'].nunique()} with distinct Education Levels {bankchurn_df['Education_Level'].unique()}")
# Percentage of total Education per Education Levels
labeled_barplot(bankchurn_df, 'Education_Level', perc=True)
Total number of Education Levels = 6 with distinct Education Levels ['High School' 'Graduate' 'Uneducated' nan 'College' 'Post-Graduate' 'Doctorate']
Observations
# Check all unique Marital Status exists in Marital Status column
print(f"Total number of Marital Status = {bankchurn_df['Marital_Status'].nunique()} with distinct Marital Status {bankchurn_df['Marital_Status'].unique()}")
# Percentage of Marital Status among Marital Status Types
labeled_barplot(bankchurn_df, 'Marital_Status', perc=True)
Total number of Marital Status = 3 with distinct Marital Status ['Married' 'Single' nan 'Divorced']
Observations
# Check all Annual Income Category of the account holder
print(f"Total number of Income Category = {bankchurn_df['Income_Category'].nunique()} with distinct Income Categories {bankchurn_df['Income_Category'].unique()}")
# Percentage of Annual Income Category of the account holder per Income Category
labeled_barplot(bankchurn_df, 'Income_Category', perc=True)
Total number of Income Category = 6 with distinct Income Categories ['$60K - $80K' 'Less than $40K' '$80K - $120K' '$40K - $60K' '$120K +' 'abc']
Observations
# Check all Annual Income Category of the account holder
print(f"Total number of Card Category = {bankchurn_df['Card_Category'].nunique()} with distinct Card Categories {bankchurn_df['Card_Category'].unique()}")
# Percentage of Card_Category of the account holder per Card Category
labeled_barplot(bankchurn_df, 'Card_Category', perc=True)
Total number of Card Category = 4 with distinct Card Categories ['Blue' 'Gold' 'Silver' 'Platinum']
Observations
# Create the Histogram and boxplot for Months_on_book of customers
histogram_boxplot(bankchurn_df, 'Months_on_book')
Observations
# Check all number of products of the account holder
print(f"Total number of products = {bankchurn_df['Total_Relationship_Count'].nunique()} with distinct number of products {bankchurn_df['Total_Relationship_Count'].unique()}")
# Percentage of products of the account holder held per number of products
labeled_barplot(bankchurn_df, 'Total_Relationship_Count', perc=True)
Total number of products = 6 with distinct number of products [5 6 4 3 2 1]
Observations
# Check all Inactive months of number of months customers are inactive
print(f"Total number of Inactive months = {bankchurn_df['Months_Inactive_12_mon'].nunique()} with distinct Inactive months {bankchurn_df['Months_Inactive_12_mon'].unique()}")
# Percentage of customers Inactive per Inactive months
labeled_barplot(bankchurn_df, 'Months_Inactive_12_mon', perc=True)
Total number of Inactive months = 7 with distinct Inactive months [1 4 2 3 6 0 5]
Observations
# Check all unique number of months where customers contacted bank
print(f"Total number of Inactive months = {bankchurn_df['Contacts_Count_12_mon'].nunique()} with distinct Inactive months {bankchurn_df['Contacts_Count_12_mon'].unique()}")
# Percentage of customers contacted bank per months
labeled_barplot(bankchurn_df, 'Contacts_Count_12_mon', perc=True)
Total number of Inactive months = 7 with distinct Inactive months [3 2 0 1 4 5 6]
Observations
# Create the Histogram and boxplot for Credit_Limit of customers
histogram_boxplot(bankchurn_df, 'Credit_Limit')
Observations
# Create the Histogram and boxplot for Total_Revolving_Bal of customers
histogram_boxplot(bankchurn_df, 'Total_Revolving_Bal')
Observations
# Create the Histogram and boxplot for Avg_Open_To_Buy of customers
histogram_boxplot(bankchurn_df, 'Avg_Open_To_Buy')
Observations
# Create the Histogram and boxplot for Total_Trans_Amt of customers
histogram_boxplot(bankchurn_df, 'Total_Trans_Amt')
Observations
# Create the Histogram and boxplot for Total_Trans_Ct of customers
histogram_boxplot(bankchurn_df, 'Total_Trans_Ct')
Observations
# Create the Histogram and boxplot for Total_Ct_Chng_Q4_Q1 of customers
histogram_boxplot(bankchurn_df, 'Total_Ct_Chng_Q4_Q1')
Observations
# Create the Histogram and boxplot for Total_Amt_Chng_Q4_Q1 of customers
histogram_boxplot(bankchurn_df, 'Total_Amt_Chng_Q4_Q1')
Observations
# Create the Histogram and boxplot for Avg_Utilization_Ratio of customers
histogram_boxplot(bankchurn_df, 'Avg_Utilization_Ratio')
Observations
# Relationship between Attrition Flag and Age
distribution_plot_wrt_target(bankchurn_df,'Customer_Age','Attrition_Flag')
# Relationship between Attrition Flag and Gender
stacked_barplot(bankchurn_df,'Gender','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Dependent Cound
stacked_barplot(bankchurn_df,'Dependent_count','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Dependent_count All 1627 8500 10127 3 482 2250 2732 2 417 2238 2655 1 269 1569 1838 4 260 1314 1574 0 135 769 904 5 64 360 424 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Education Level
stacked_barplot(bankchurn_df,'Education_Level','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Marital Status
stacked_barplot(bankchurn_df,'Marital_Status','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Income Category
stacked_barplot(bankchurn_df,'Income_Category','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 abc 187 925 1112 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Card Category
stacked_barplot(bankchurn_df,'Card_Category','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Months on book
distribution_plot_wrt_target(bankchurn_df,'Months_on_book','Attrition_Flag')
# Relationship between Attrition Flag and Total Relationship Count
stacked_barplot(bankchurn_df,'Total_Relationship_Count','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Months Inactive 12 month
stacked_barplot(bankchurn_df,'Months_Inactive_12_mon','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Months_Inactive_12_mon All 1627 8500 10127 3 826 3020 3846 2 505 2777 3282 4 130 305 435 1 100 2133 2233 5 32 146 178 6 19 105 124 0 15 14 29 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Contacts Count 12 month
stacked_barplot(bankchurn_df,'Contacts_Count_12_mon','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Contacts_Count_12_mon All 1627 8500 10127 3 681 2699 3380 2 403 2824 3227 4 315 1077 1392 1 108 1391 1499 5 59 117 176 6 54 0 54 0 7 392 399 ------------------------------------------------------------------------------------------------------------------------
# Relationship between Attrition Flag and Customers Credit_Limit
distribution_plot_wrt_target(bankchurn_df,'Credit_Limit','Attrition_Flag')
# Relationship between Attrition Flag and Total Revolving Balance
distribution_plot_wrt_target(bankchurn_df,'Total_Revolving_Bal','Attrition_Flag')
# Relationship between Attrition Flag and amount left on the credit card to use
distribution_plot_wrt_target(bankchurn_df,'Avg_Open_To_Buy','Attrition_Flag')
# Relationship between Attrition Flag and Total Trans Amount
distribution_plot_wrt_target(bankchurn_df,'Total_Trans_Amt','Attrition_Flag')
# Relationship between Attrition Flag and Total_Amt_Chng_Q4_Q1
distribution_plot_wrt_target(bankchurn_df,'Total_Amt_Chng_Q4_Q1','Attrition_Flag')
# Relationship between Attrition Flag and Total Trans Count
distribution_plot_wrt_target(bankchurn_df,'Total_Trans_Ct','Attrition_Flag')
# Relationship between Attrition Flag and Total_Ct_Chng_Q4_Q1
distribution_plot_wrt_target(bankchurn_df,'Total_Ct_Chng_Q4_Q1','Attrition_Flag')
# Relationship between Attrition Flag and Average Utilization Ratio
distribution_plot_wrt_target(bankchurn_df,'Avg_Utilization_Ratio','Attrition_Flag')
# Replace the Attrition_Flag column value to 0 and 1
bankchurn_df['Attrition_Flag'] = bankchurn_df['Attrition_Flag'].map({'Existing Customer': 0, 'Attrited Customer': 1})
# Co-relation between all numeric columns up to 3 decimals
plt.figure(figsize =(12,12))
sns.heatmap(data = bankchurn_df.select_dtypes(include=np.number).corr(),annot = True,cmap='YlGnBu',fmt=".3f",vmin=-1,vmax=1)
plt.show()
Observations
Multicollinearity:
Customer Demographics and Relationships:
Credit Usage:
Transaction Changes:
Attrition Factors:
# Relationship between all numeric columns
sns.pairplot(data=bankchurn_df,corner=True,hue='Attrition_Flag',height = 1.5);
# Get the 25th percentile for each numeric columns
Quarter1 = bankchurn_df.select_dtypes(include=["float64", "int64"]).quantile(0.25)
# Get the 75th percentile for each numeric columns
Quarter3 = bankchurn_df.select_dtypes(include=["float64", "int64"]).quantile(0.75)
# Calculate Inter Quantile Range (75th perentile - 25th percentile)
IQR = Quarter3 - Quarter1
# Find lower and upper bounds for all values. All values outside these bounds are outliers
lower = Quarter1 - 1.5 * IQR
upper = Quarter3 + 1.5 * IQR
# Find the Lower outliers range
Quarter1
| 0.25 | |
|---|---|
| Attrition_Flag | 0.000 |
| Customer_Age | 41.000 |
| Dependent_count | 1.000 |
| Months_on_book | 31.000 |
| Total_Relationship_Count | 3.000 |
| Months_Inactive_12_mon | 2.000 |
| Contacts_Count_12_mon | 2.000 |
| Credit_Limit | 2555.000 |
| Total_Revolving_Bal | 359.000 |
| Avg_Open_To_Buy | 1324.500 |
| Total_Amt_Chng_Q4_Q1 | 0.631 |
| Total_Trans_Amt | 2155.500 |
| Total_Trans_Ct | 45.000 |
| Total_Ct_Chng_Q4_Q1 | 0.582 |
| Avg_Utilization_Ratio | 0.023 |
# Find the Upper outliers range
Quarter3
| 0.75 | |
|---|---|
| Attrition_Flag | 0.000 |
| Customer_Age | 52.000 |
| Dependent_count | 3.000 |
| Months_on_book | 40.000 |
| Total_Relationship_Count | 5.000 |
| Months_Inactive_12_mon | 3.000 |
| Contacts_Count_12_mon | 3.000 |
| Credit_Limit | 11067.500 |
| Total_Revolving_Bal | 1784.000 |
| Avg_Open_To_Buy | 9859.000 |
| Total_Amt_Chng_Q4_Q1 | 0.859 |
| Total_Trans_Amt | 4741.000 |
| Total_Trans_Ct | 81.000 |
| Total_Ct_Chng_Q4_Q1 | 0.818 |
| Avg_Utilization_Ratio | 0.503 |
# Check the Outlier percentage for each numeric column
((bankchurn_df.select_dtypes(include=["float64", "int64"]) < lower)
|(bankchurn_df.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(bankchurn_df) * 100
| 0 | |
|---|---|
| Attrition_Flag | 16.065962 |
| Customer_Age | 0.019749 |
| Dependent_count | 0.000000 |
| Months_on_book | 3.811593 |
| Total_Relationship_Count | 0.000000 |
| Months_Inactive_12_mon | 3.268490 |
| Contacts_Count_12_mon | 6.211119 |
| Credit_Limit | 9.716599 |
| Total_Revolving_Bal | 0.000000 |
| Avg_Open_To_Buy | 9.509233 |
| Total_Amt_Chng_Q4_Q1 | 3.910339 |
| Total_Trans_Amt | 8.847635 |
| Total_Trans_Ct | 0.019749 |
| Total_Ct_Chng_Q4_Q1 | 3.890590 |
| Avg_Utilization_Ratio | 0.000000 |
# Replace the "abc" income category to nan for imputaion
bankchurn_df['Income_Category'] = bankchurn_df['Income_Category'].replace('abc', np.nan)
# Get the proportion of all categorical columns values corresponding to Attrition Flag
print(bankchurn_df.groupby(['Attrition_Flag','Education_Level'])['Education_Level'].count() / bankchurn_df.groupby(['Attrition_Flag'])['Education_Level'].count()*100)
print(bankchurn_df.groupby(['Attrition_Flag','Card_Category'])['Card_Category'].count() / bankchurn_df.groupby(['Attrition_Flag'])['Card_Category'].count()*100)
print(bankchurn_df.groupby(['Attrition_Flag','Marital_Status'])['Marital_Status'].count() / bankchurn_df.groupby(['Attrition_Flag'])['Marital_Status'].count()*100)
print(bankchurn_df.groupby(['Attrition_Flag','Income_Category'])['Income_Category'].count() / bankchurn_df.groupby(['Attrition_Flag'])['Income_Category'].count()*100)
print(bankchurn_df.groupby(['Attrition_Flag','Gender'])['Gender'].count() / bankchurn_df.groupby(['Attrition_Flag'])['Gender'].count()*100)
Attrition_Flag Education_Level
0 College 11.869559
Doctorate 4.919165
Graduate 36.493022
High School 23.587122
Post-Graduate 5.858781
Uneducated 17.272350
1 College 11.232677
Doctorate 6.929249
Graduate 35.521517
High School 22.319475
Post-Graduate 6.710430
Uneducated 17.286652
Name: Education_Level, dtype: float64
Attrition_Flag Card_Category
0 Blue 93.141176
Gold 1.117647
Platinum 0.176471
Silver 5.564706
1 Blue 93.362016
Gold 1.290719
Platinum 0.307314
Silver 5.039951
Name: Card_Category, dtype: float64
Attrition_Flag Marital_Status
0 Divorced 7.956853
Married 50.482234
Single 41.560914
1 Divorced 8.077437
Married 47.329773
Single 44.592790
Name: Marital_Status, dtype: float64
Attrition_Flag Income_Category
0 $120K + 7.933993
$40K - $60K 20.052805
$60K - $80K 16.013201
$80K - $120K 17.069307
Less than $40K 38.930693
1 $120K + 8.750000
$40K - $60K 18.819444
$60K - $80K 13.125000
$80K - $120K 16.805556
Less than $40K 42.500000
Name: Income_Category, dtype: float64
Attrition_Flag Gender
0 F 52.094118
M 47.905882
1 F 57.160418
M 42.839582
Name: Gender, dtype: float64
# We need to pass numerical values for each categorical column for imputation so we will label encode them
gender = {
"M": 0,
"F": 1
}
education_level = {
"Post-Graduate": 0,
"Doctorate": 1,
"College": 2,
"Uneducated": 3,
"High School": 4,
"Graduate":5
}
marital_Status = {
"Divorced": 0,
"Single": 1,
"Married": 2
}
income_category = {
"$120K": 0,
"$40K - $60K": 3,
"$60K - $80K": 1,
"$80K - $120K": 2,
"Less than $40K": 4
}
card_category = {
"Blue": 3,
"Silver": 2,
"Gold": 1,
"Platinum": 0
}
bankchurn_df['Gender'] = bankchurn_df['Gender'].map(gender)
bankchurn_df['Education_Level'] = bankchurn_df['Education_Level'].map(education_level)
bankchurn_df['Marital_Status'] = bankchurn_df['Marital_Status'].map(marital_Status)
bankchurn_df['Income_Category'] = bankchurn_df['Income_Category'].map(income_category)
bankchurn_df['Card_Category'] = bankchurn_df['Card_Category'].map(card_category)
# defining the explanatory (independent) and response (dependent) variables
X = bankchurn_df.drop(['Attrition_Flag'], axis=1)
y = bankchurn_df['Attrition_Flag']
# splitting the data in an 80:20 ratio for train and Test sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=RS)
# splitting the data in an 75:25 ratio for train and Validation sets
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=RS)
# Print the shape of Train, Validation and test data set along with percentage to dispay proportion distribution
print("Shape of training set:", X_train.shape, y_train.shape)
print("Shape of validation set:", X_val.shape, y_val.shape)
print("Shape of test set:", X_test.shape, y_test.shape)
print('\n')
print("Percentage of classes in training set:")
print(100*y_train.value_counts(normalize=True), '\n')
print("Percentage of classes in validation set:")
print(100*y_val.value_counts(normalize=True), '\n')
print("Percentage of classes in test set:")
print(100*y_test.value_counts(normalize=True))
Shape of training set: (6075, 19) (6075,) Shape of validation set: (2026, 19) (2026,) Shape of test set: (2026, 19) (2026,) Percentage of classes in training set: Attrition_Flag 0 83.934156 1 16.065844 Name: proportion, dtype: float64 Percentage of classes in validation set: Attrition_Flag 0 83.909181 1 16.090819 Name: proportion, dtype: float64 Percentage of classes in test set: Attrition_Flag 0 83.958539 1 16.041461 Name: proportion, dtype: float64
#Print shape of Train, Validation and Test data set
print("Shape of training set:", X_train.shape)
print("Shape of validation set:", X_val.shape)
print("Shape of test set:", X_test.shape)
Shape of training set: (6075, 19) Shape of validation set: (2026, 19) Shape of test set: (2026, 19)
# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Fit the scaler on the training data and transform it
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
# Transform the validation and test data using the same scaler
X_val = pd.DataFrame(scaler.transform(X_val),columns=X_val.columns)
X_test = pd.DataFrame(scaler.transform(X_test),columns=X_val.columns)
# Print the scaled data
print("Scaled Training Data:\n", X_train)
print("Scaled Validation Data:\n", X_val)
print("Scaled Test Data:\n", X_test)
Scaled Training Data:
Customer_Age Gender Dependent_count Education_Level Marital_Status \
0 0.297872 0.0 0.4 NaN 0.5
1 0.382979 0.0 0.2 NaN 1.0
2 0.468085 0.0 0.8 0.8 1.0
3 0.319149 0.0 0.4 1.0 NaN
4 0.425532 0.0 0.8 0.8 0.0
... ... ... ... ... ...
6070 0.446809 1.0 0.4 1.0 0.5
6071 0.446809 0.0 0.6 0.6 0.5
6072 0.510638 1.0 0.0 0.0 0.0
6073 0.404255 1.0 0.8 0.6 1.0
6074 0.617021 1.0 0.4 0.4 NaN
Income_Category Card_Category Months_on_book \
0 NaN 1.000000 0.186047
1 NaN 1.000000 0.488372
2 0.333333 1.000000 0.534884
3 0.000000 0.666667 0.534884
4 0.666667 0.666667 0.534884
... ... ... ...
6070 1.000000 1.000000 0.534884
6071 NaN 1.000000 0.534884
6072 1.000000 1.000000 0.534884
6073 1.000000 1.000000 0.581395
6074 NaN 1.000000 0.651163
Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon \
0 1.0 0.666667 0.500000
1 1.0 0.333333 0.000000
2 0.8 0.166667 0.333333
3 1.0 0.333333 0.000000
4 0.2 0.333333 0.500000
... ... ... ...
6070 1.0 0.333333 0.333333
6071 0.4 0.500000 0.666667
6072 0.4 0.666667 0.166667
6073 0.4 0.666667 0.500000
6074 1.0 0.333333 0.333333
Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy \
0 0.562847 0.636472 0.534516
1 0.043736 0.752880 0.028401
2 0.162034 1.000000 0.123776
3 0.772777 0.000000 0.782183
4 0.411023 0.538737 0.396105
... ... ... ...
6070 0.112060 0.000000 0.148815
6071 1.000000 0.000000 1.000000
6072 0.061362 0.882400 0.035849
6073 0.104895 0.000000 0.141946
6074 0.186673 0.000000 0.220338
Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct \
0 0.174206 0.065483 0.279070
1 0.144673 0.047624 0.162791
2 0.326355 0.212362 0.534884
3 0.228037 0.038890 0.224806
4 0.281869 0.402081 0.573643
... ... ... ...
6070 0.259065 0.059976 0.193798
6071 0.193271 0.084010 0.178295
6072 0.292710 0.762880 0.868217
6073 0.371215 0.017637 0.085271
6074 0.254206 0.209080 0.457364
Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 0.143511 0.080402
1 0.170167 0.660302
2 0.237211 0.371859
3 0.080775 0.000000
4 0.201939 0.090452
... ... ...
6070 0.254173 0.000000
6071 0.100969 0.000000
6072 0.168821 0.643216
6073 0.201939 0.000000
6074 0.195207 0.000000
[6075 rows x 19 columns]
Scaled Validation Data:
Customer_Age Gender Dependent_count Education_Level Marital_Status \
0 0.234043 0.0 0.0 0.0 0.5
1 0.680851 0.0 0.4 0.6 0.5
2 0.340426 0.0 0.6 0.6 1.0
3 0.446809 0.0 0.6 NaN 1.0
4 0.723404 1.0 0.2 0.8 0.5
... ... ... ... ... ...
2021 0.276596 0.0 0.4 0.6 0.5
2022 0.574468 1.0 0.0 1.0 0.5
2023 0.404255 1.0 0.8 1.0 1.0
2024 0.574468 1.0 0.4 1.0 0.5
2025 0.255319 1.0 0.0 1.0 0.5
Income_Category Card_Category Months_on_book \
0 0.333333 1.0 0.325581
1 0.333333 1.0 0.767442
2 NaN 0.0 0.232558
3 0.333333 1.0 0.534884
4 1.000000 1.0 0.534884
... ... ... ...
2021 0.000000 1.0 0.534884
2022 1.000000 1.0 0.627907
2023 NaN 1.0 0.511628
2024 0.666667 1.0 0.674419
2025 1.000000 1.0 0.441860
Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon \
0 0.8 0.333333 0.500000
1 0.0 0.500000 0.166667
2 0.4 0.666667 0.500000
3 0.4 0.333333 0.500000
4 0.8 0.333333 0.333333
... ... ... ...
2021 0.8 0.500000 0.500000
2022 0.6 0.166667 0.666667
2023 1.0 0.500000 0.000000
2024 1.0 0.333333 0.000000
2025 1.0 0.500000 0.500000
Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy \
0 0.419851 0.000000 0.443865
1 0.267482 0.000000 0.297803
2 1.000000 0.822408 0.940010
3 0.249253 0.443385 0.247986
4 0.037781 0.642431 0.030748
... ... ... ...
2021 0.378584 0.516091 0.366661
2022 0.000000 0.000000 0.041393
2023 0.041197 0.811681 0.021677
2024 0.117714 0.000000 0.154234
2025 0.009574 0.297974 0.028836
Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct \
0 0.433271 0.137977 0.348837
1 0.339439 0.427785 0.379845
2 0.328972 0.738344 0.713178
3 0.269533 0.841048 0.728682
4 0.206355 0.204351 0.472868
... ... ... ...
2021 0.314766 0.015189 0.077519
2022 0.188037 0.083676 0.333333
2023 0.217944 0.040670 0.279070
2024 0.351028 0.047903 0.224806
2025 0.246355 0.249805 0.534884
Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 0.259558 0.000000
1 0.243134 0.000000
2 0.146742 0.060302
3 0.222132 0.115578
4 0.165320 0.605025
... ... ...
2021 0.179591 0.093467
2022 0.127087 0.000000
2023 0.173129 0.732663
2024 0.314216 0.000000
2025 0.225363 0.429146
[2026 rows x 19 columns]
Scaled Test Data:
Customer_Age Gender Dependent_count Education_Level Marital_Status \
0 0.127660 0.0 0.2 0.8 0.5
1 0.510638 0.0 0.2 0.0 0.5
2 0.595745 1.0 0.4 0.8 1.0
3 0.744681 0.0 0.0 0.6 1.0
4 0.319149 1.0 0.6 0.4 1.0
... ... ... ... ... ...
2021 0.127660 0.0 0.0 0.0 1.0
2022 0.127660 1.0 0.0 0.8 0.0
2023 0.404255 0.0 0.4 NaN 1.0
2024 0.148936 1.0 0.8 0.4 1.0
2025 0.617021 1.0 0.4 NaN 1.0
Income_Category Card_Category Months_on_book \
0 0.333333 1.0 0.302326
1 0.000000 1.0 0.534884
2 0.666667 1.0 0.534884
3 NaN 1.0 0.534884
4 0.666667 1.0 0.046512
... ... ... ...
2021 0.000000 1.0 0.302326
2022 1.000000 1.0 0.534884
2023 NaN 1.0 0.627907
2024 0.666667 1.0 0.302326
2025 1.000000 1.0 0.558140
Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon \
0 0.2 0.500000 0.333333
1 0.6 0.500000 0.333333
2 0.4 0.500000 0.500000
3 0.6 0.500000 0.666667
4 0.8 0.500000 0.666667
... ... ... ...
2021 1.0 0.333333 0.500000
2022 0.8 0.333333 0.666667
2023 0.6 0.666667 0.666667
2024 0.0 0.333333 0.500000
2025 0.4 0.500000 0.166667
Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy \
0 0.150213 0.448947 0.152640
1 0.026565 0.000000 0.066858
2 0.074180 0.000000 0.112502
3 0.687282 1.000000 0.627282
4 0.086877 1.000000 0.051730
... ... ... ...
2021 0.122128 0.592372 0.115255
2022 0.038113 0.639253 0.031299
2023 0.454617 0.545888 0.437373
2024 0.043404 1.000000 0.010056
2025 0.030676 0.384585 0.042746
Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct \
0 0.282617 0.776733 0.643411
1 0.274393 0.094804 0.240310
2 0.275888 0.211862 0.496124
3 0.158505 0.063870 0.131783
4 0.277009 0.121453 0.356589
... ... ... ...
2021 0.487477 0.108267 0.348837
2022 0.333832 0.090742 0.356589
2023 0.171589 0.234060 0.472868
2024 0.272897 0.238344 0.604651
2025 0.404112 0.095026 0.310078
Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 0.162359 0.176884
1 0.139742 0.000000
2 0.205170 0.000000
3 0.134626 0.104523
4 0.117394 0.586935
... ... ...
2021 0.153743 0.273367
2022 0.217017 0.598995
2023 0.185784 0.083417
2024 0.161551 0.880402
2025 0.165051 0.396985
[2026 rows x 19 columns]
X_train.isna().sum()
| 0 | |
|---|---|
| Customer_Age | 0 |
| Gender | 0 |
| Dependent_count | 0 |
| Education_Level | 928 |
| Marital_Status | 457 |
| Income_Category | 1103 |
| Card_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
# Initialize the KNN Imputer
imputer = KNNImputer(n_neighbors=5)
# Fit the imputer on the training data and transform it
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
# Transform the validation and test data using the same imputer
X_val = pd.DataFrame(imputer.transform(X_val),columns=X_val.columns)
X_test = pd.DataFrame(imputer.transform(X_test),columns=X_val.columns)
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
def plot_confusion_matrix(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
# Predict the target values using the provided model and predictors
y_pred = model.predict(predictors)
# Compute the confusion matrix comparing the true target values with the predicted values
cm = confusion_matrix(target, y_pred)
# Create labels for each cell in the confusion matrix with both count and percentage
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2) # reshaping to a matrix
# Set the figure size for the plot
plt.figure(figsize=(6, 4))
# Plot the confusion matrix as a heatmap with the labels
sns.heatmap(cm, annot=labels, fmt="")
# Add a label to the y-axis
plt.ylabel("True label")
# Add a label to the x-axis
plt.xlabel("Predicted label")
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=RS)))
models.append(("Random forest", RandomForestClassifier(random_state=RS)))
models.append(("GBM", GradientBoostingClassifier(random_state=RS)))
models.append(("Adaboost", AdaBoostClassifier(random_state=RS)))
models.append(("Xgboost", XGBClassifier(random_state=RS, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=RS)))
models_train_score = [] # Empty list to store all the training score
models_validation_score = [] # Empty list to store all the validation score
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
models_train_score.append((name,scores))
scores_val = recall_score(y_val, model.predict(X_val))
models_validation_score.append((name,scores_val))
#Convert models list into 2 columns data frame
models_train_score_df = pd.DataFrame(models_train_score, columns=['Model Name', 'Score'])
models_validation_score_df = pd.DataFrame(models_validation_score, columns=['Model Name', 'Score'])
# merge models_train_score_df and models_validation_score_df into one data frame
models_score_df = pd.merge(models_train_score_df, models_validation_score_df, on='Model Name', suffixes=('_train_original', '_val_original'))
# Print the original Count of Labels in training set where Existing Customer = 0 and Attrited Customer = 1
print("Orginal Count of Labels \n")
print("Total counts of Attrited Customer : {}".format(sum(y_train == 1)))
print("Total counts of Existing Customer: {} \n".format(sum(y_train == 0)))
Orginal Count of Labels Total counts of Attrited Customer : 976 Total counts of Existing Customer: 5099
# Creat synthetic data to treat imbalance class distribution of traget variable
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=RS
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
# Print the Oversampled Count of Labels in training set where Existing Customer = 0 and Attrited Customer = 1
print("Oversampled Count of Labels \n")
print("Total counts of Attrited Customer : {}".format(sum(y_train_over == 1)))
print("Total counts of Existing Customer: {} \n".format(sum(y_train_over == 0)))
#Print the shape of oversampled data set
print("After Oversampling\n")
print("The shape of train_X: {}".format(X_train_over.shape))
print("The shape of train_y: {} \n".format(y_train_over.shape))
Oversampled Count of Labels Total counts of Attrited Customer : 5099 Total counts of Existing Customer: 5099 After Oversampling The shape of train_X: (10198, 19) The shape of train_y: (10198,)
models_train_score_over = [] # Empty list to store all the training score
models_validation_score_over = [] # Empty list to store all the validation score
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
models_train_score_over.append((name,scores))
scores_val = recall_score(y_val, model.predict(X_val))
models_validation_score_over.append((name,scores_val))
#Convert models list into 2 columns data frame
models_train_score_over_df = pd.DataFrame(models_train_score_over, columns=['Model Name', 'Score'])
models_validation_score_over_df = pd.DataFrame(models_validation_score_over, columns=['Model Name', 'Score'])
# merge models_train_score_df, models_validation_score_df, models_train_score_df and models_validation_score_df into one data frame
models_score_over_df = pd.merge(models_train_score_over_df, models_validation_score_over_df, on='Model Name', suffixes=('_train_oversampled', '_val_oversampled'))
# Print the original Count of Labels in training set where Existing Customer = 0 and Attrited Customer = 1
print("Orginal Count of Labels \n")
print("Total counts of Attrited Customer : {}".format(sum(y_train == 1)))
print("Total counts of Existing Customer: {} \n".format(sum(y_train == 0)))
Orginal Count of Labels Total counts of Attrited Customer : 976 Total counts of Existing Customer: 5099
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=RS, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
# Print the Undersampled Count of Labels in training set where Existing Customer = 0 and Attrited Customer = 1
print("Undersampled Count of Labels \n")
print("Total counts of Attrited Customer: {}".format(sum(y_train_un == 1)))
print("Total counts of Existing Customer: {} \n".format(sum(y_train_un == 0)))
#Print the shape of oversampled data set
print("After Undersampling\n")
print("The shape of train_X: {}".format(X_train_un.shape))
print("The shape of train_y: {} \n".format(y_train_un.shape))
Undersampled Count of Labels Total counts of Attrited Customer: 976 Total counts of Existing Customer: 976 After Undersampling The shape of train_X: (1952, 19) The shape of train_y: (1952,)
models_train_score_under = [] # Empty list to store all the training score
models_validation_score_under = [] # Empty list to store all the validation score
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un))
models_train_score_under.append((name,scores))
scores_val = recall_score(y_val, model.predict(X_val))
models_validation_score_under.append((name,scores_val))
#Convert models list into 2 columns data frame
models_train_score_under_df = pd.DataFrame(models_train_score_under, columns=['Model Name', 'Score'])
models_validation_score_under_df = pd.DataFrame(models_validation_score_under, columns=['Model Name', 'Score'])
# merge models_train_score_df, models_validation_score_df, models_train_score_df and models_validation_score_df into one data frame
models_score_under_df = pd.merge(models_train_score_under_df, models_validation_score_under_df, on='Model Name', suffixes=('_train_undersampled', '_val_undersampled'))
# merge models_score_df, models_score_over_df and models_score_under_df into one data frame
print("Recall Score comparison on Training and Validation score for basic models\n")
models_score_final_df = pd.merge(models_score_df, models_score_over_df, on='Model Name')
models_score_final_df = pd.merge(models_score_final_df, models_score_under_df, on='Model Name')
models_score_final_df
Recall Score comparison on Training and Validation score for basic models
| Model Name | Score_train_original | Score_val_original | Score_train_oversampled | Score_val_oversampled | Score_train_undersampled | Score_val_undersampled | |
|---|---|---|---|---|---|---|---|
| 0 | Bagging | 0.985656 | 0.815951 | 0.997254 | 0.834356 | 0.989754 | 0.932515 |
| 1 | Random forest | 1.000000 | 0.822086 | 1.000000 | 0.858896 | 1.000000 | 0.941718 |
| 2 | GBM | 0.875000 | 0.861963 | 0.980977 | 0.892638 | 0.979508 | 0.963190 |
| 3 | Adaboost | 0.839139 | 0.858896 | 0.963130 | 0.904908 | 0.952869 | 0.960123 |
| 4 | Xgboost | 1.000000 | 0.889571 | 1.000000 | 0.898773 | 1.000000 | 0.957055 |
| 5 | dtree | 1.000000 | 0.815951 | 1.000000 | 0.794479 | 1.000000 | 0.907975 |
Next Steps
Note
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
param_grid = {
'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
param_grid = {
"n_estimators": [50,110,25],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
param_grid = {
'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10, 15],
'min_impurity_decrease': [0.0001,0.001]
}
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# defining DecisionTreeClassifier model
dt_tuned_model = DecisionTreeClassifier(random_state=RS)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=dt_tuned_model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=RS)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.751941391941392:
# Build a DecisionTreeClassifier model for paremeters from randomized_cv
dt_tuned_model = DecisionTreeClassifier(random_state=RS,**randomized_cv.best_params_)
# Train model with original data set
dt_tuned_model.fit(X_train,y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7,
random_state=1)# Calculating different metrics on train set
dt_train = model_performance_classification_sklearn(
dt_tuned_model, X_train, y_train
)
print("Training performance:")
dt_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.938765 | 0.805328 | 0.811983 | 0.808642 |
# Calculating different metrics on validation set
dt_val = model_performance_classification_sklearn(
dt_tuned_model, X_val, y_val
)
print("Validation performance:")
dt_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.929911 | 0.782209 | 0.782209 | 0.782209 |
# Train model with Oversamnpled data set
dt_tuned_model.fit(X_train_over,y_train_over)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7,
random_state=1)# Calculating different metrics on train set
dt_train_over = model_performance_classification_sklearn(
dt_tuned_model, X_train_over, y_train_over
)
print("Training performance:")
dt_train_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.916356 | 0.917827 | 0.915135 | 0.916479 |
# Calculating different metrics on validation set
dt_val_over = model_performance_classification_sklearn(
dt_tuned_model, X_val, y_val
)
print("Validation performance:")
dt_val_over
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.911649 | 0.892638 | 0.668966 | 0.764783 |
# Train model with Undersampled data set
dt_tuned_model.fit(X_train_un,y_train_un)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
min_impurity_decrease=0.0001, min_samples_leaf=7,
random_state=1)# Calculating different metrics on train set
dt_train_under = model_performance_classification_sklearn(
dt_tuned_model, X_train_un, y_train_un
)
print("Training performance:")
dt_train_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.913934 | 0.939549 | 0.893762 | 0.916084 |
# Calculating different metrics on validation set
dt_val_under = model_performance_classification_sklearn(
dt_tuned_model, X_val, y_val
)
print("Validation performance:")
dt_val_under
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.867226 | 0.920245 | 0.552486 | 0.690449 |
# defining BaggingClassifier model
bag_tuned_model = BaggingClassifier(random_state=RS)
# Parameter grid to pass in RandomSearchCV
param_grid = {
'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
#Calling RandomizedSearchCV
bag_randomized_cv = RandomizedSearchCV(estimator=bag_tuned_model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=RS)
#Fitting parameters in RandomizedSearchCV
bag_randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(bag_randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 0.8, 'max_features': 0.9} with CV score=0.751941391941392:
# Build a DecisionTreeClassifier model for paremeters from bag_randomized_cv
bag_tuned_model = BaggingClassifier(random_state=RS,**bag_randomized_cv.best_params_)
# Train model with original data set
bag_tuned_model.fit(X_train,y_train)
BaggingClassifier(max_features=0.9, max_samples=0.8, n_estimators=70,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. BaggingClassifier(max_features=0.9, max_samples=0.8, n_estimators=70,
random_state=1)# Calculating different metrics on train set
bag_train = model_performance_classification_sklearn(
bag_tuned_model, X_train, y_train
)
print("Training performance:")
bag_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.999506 | 0.997951 | 0.998974 | 0.998462 |
# Calculating different metrics on Validation set
bag_val = model_performance_classification_sklearn(
bag_tuned_model, X_val, y_val
)
print("Validation performance:")
bag_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.964462 | 0.846626 | 0.926174 | 0.884615 |
# Train model with Oversamnpled data set
bag_tuned_model.fit(X_train_over,y_train_over)
BaggingClassifier(max_features=0.9, max_samples=0.8, n_estimators=70,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. BaggingClassifier(max_features=0.9, max_samples=0.8, n_estimators=70,
random_state=1)# Calculating different metrics on train set
bag_train_over = model_performance_classification_sklearn(
bag_tuned_model, X_train_over, y_train_over
)
print("Training performance:")
bag_train_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.999902 | 1.0 | 0.999804 | 0.999902 |
# Calculating different metrics on Validation set
bag_val_over = model_performance_classification_sklearn(
bag_tuned_model, X_val, y_val
)
print("Validation performance:")
bag_val_over
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.950642 | 0.877301 | 0.82659 | 0.85119 |
# Train model with Undersamnpled data set
bag_tuned_model.fit(X_train_un,y_train_un)
BaggingClassifier(max_features=0.9, max_samples=0.8, n_estimators=70,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. BaggingClassifier(max_features=0.9, max_samples=0.8, n_estimators=70,
random_state=1)# Calculating different metrics on train set
bag_train_under = model_performance_classification_sklearn(
bag_tuned_model, X_train_un, y_train_un
)
print("Training performance:")
bag_train_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating different metrics on Validation set
bag_val_under = model_performance_classification_sklearn(
bag_tuned_model, X_val, y_val
)
print("Validation performance:")
bag_val_under
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92843 | 0.947853 | 0.707094 | 0.809961 |
# defining RandomForest model
rf_tuned_model = RandomForestClassifier(random_state=RS)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": [50,110,25],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=rf_tuned_model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=RS)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 110, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.766363160648875:
# Build a DecisionTreeClassifier model for paremeters from randomized_cv
rf_tuned_model = RandomForestClassifier(random_state=RS,**randomized_cv.best_params_)
# Train model with original data set
rf_tuned_model.fit(X_train,y_train)
RandomForestClassifier(max_samples=0.6, n_estimators=110, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, n_estimators=110, random_state=1)
# Calculating different metrics on train set
rf_train = model_performance_classification_sklearn(
rf_tuned_model, X_train, y_train
)
print("Training performance:")
rf_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.996214 | 0.978484 | 0.99791 | 0.988101 |
# Calculating different metrics on Validation set
rf_val = model_performance_classification_sklearn(
rf_tuned_model, X_val, y_val
)
print("Validation performance:")
rf_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.959033 | 0.794479 | 0.941818 | 0.861897 |
# Train model with Oversamnpled data set
rf_tuned_model.fit(X_train_over,y_train_over)
RandomForestClassifier(max_samples=0.6, n_estimators=110, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, n_estimators=110, random_state=1)
# Calculating different metrics on train set
rf_train_over = model_performance_classification_sklearn(
rf_tuned_model, X_train_over, y_train_over
)
print("Training performance:")
rf_train_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.999019 | 0.999804 | 0.998238 | 0.99902 |
# Calculating different metrics on Validation set
rf_val_over = model_performance_classification_sklearn(
rf_tuned_model, X_val, y_val
)
print("Validation performance:")
rf_val_over
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.953603 | 0.871166 | 0.845238 | 0.858006 |
# Train model with Undersamnpled data set
rf_tuned_model.fit(X_train_un,y_train_un)
RandomForestClassifier(max_samples=0.6, n_estimators=110, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_samples=0.6, n_estimators=110, random_state=1)
# Calculating different metrics on train set
rf_train_under = model_performance_classification_sklearn(
rf_tuned_model, X_train_un, y_train_un
)
print("Training performance:")
rf_train_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.996926 | 0.998975 | 0.994898 | 0.996933 |
# Calculating different metrics on Validation set
rf_val_under = model_performance_classification_sklearn(
rf_tuned_model, X_val, y_val
)
print("Validation performance:")
rf_val_under
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.929418 | 0.920245 | 0.719424 | 0.807537 |
# defining Adaboost model
ad_tuned_model = AdaBoostClassifier(random_state=RS)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
#Calling RandomizedSearchCV for Adaboost
adaboost_cv = RandomizedSearchCV(estimator=ad_tuned_model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=RS)
#Fitting parameters in RandomizedSearchCV
adaboost_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(adaboost_cv.best_params_,adaboost_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.836064887493459:
# Build a Adaboost model for paremeters from adaboost_cv
ad_tuned_model = AdaBoostClassifier(random_state=RS,**adaboost_cv.best_params_)
# Train model with original data set
ad_tuned_model.fit(X_train,y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
# Calculating different metrics on train set
ad_train = model_performance_classification_sklearn(
ad_tuned_model, X_train, y_train
)
print("Training performance:")
ad_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982716 | 0.92623 | 0.964781 | 0.945112 |
# Calculating different metrics on Validation set
ad_val = model_performance_classification_sklearn(
ad_tuned_model, X_val, y_val
)
print("Validation performance:")
ad_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969891 | 0.871166 | 0.937294 | 0.903021 |
# Train model with Oversamnpled data set
ad_tuned_model.fit(X_train_over,y_train_over)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
# Calculating different metrics on train set
ad_train_over = model_performance_classification_sklearn(
ad_tuned_model, X_train_over, y_train_over
)
print("Training performance:")
ad_train_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.987645 | 0.983918 | 0.991306 | 0.987598 |
# Calculating different metrics on Validation set
ad_val_over = model_performance_classification_sklearn(
ad_tuned_model, X_val, y_val
)
print("Validation performance:")
ad_val_over
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969398 | 0.886503 | 0.920382 | 0.903125 |
# Train model with Undersampled data set
ad_tuned_model.fit(X_train_un,y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
# Calculating different metrics on train set
ad_train_under = model_performance_classification_sklearn(
ad_tuned_model, X_train_un, y_train_un
)
print("Training performance:")
ad_train_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.991291 | 0.996926 | 0.985816 | 0.99134 |
# Calculating different metrics on Validation set
ad_val_under = model_performance_classification_sklearn(
ad_tuned_model, X_val, y_val
)
print("Validation performance:")
ad_val_under
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.935341 | 0.96319 | 0.725173 | 0.827404 |
# defining GradientBoosting model
gb_tuned_model = GradientBoostingClassifier(random_state=RS)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
#Calling RandomizedSearchCV for GradientBoosting
gboost_cv = RandomizedSearchCV(estimator=gb_tuned_model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=RS)
#Fitting parameters in RandomizedSearchCV
gboost_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(gboost_cv.best_params_,gboost_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)} with CV score=0.7694348508634222:
# Build a GradientBoosting model for paremeters from gboost_cv
gb_tuned_model = GradientBoostingClassifier(random_state=RS,**gboost_cv.best_params_)
# Train model with original data set
gb_tuned_model.fit(X_train,y_train)
GradientBoostingClassifier(init=DecisionTreeClassifier(random_state=1),
learning_rate=0.05, max_features=0.7, random_state=1,
subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=DecisionTreeClassifier(random_state=1),
learning_rate=0.05, max_features=0.7, random_state=1,
subsample=0.7)DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
# Calculating different metrics on train set
gb_train = model_performance_classification_sklearn(
gb_tuned_model, X_train, y_train
)
print("Training performance:")
gb_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating different metrics on Validation set
gb_val = model_performance_classification_sklearn(
gb_tuned_model, X_val, y_val
)
print("Validation performance:")
gb_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.938796 | 0.815951 | 0.806061 | 0.810976 |
# Train model with Oversampled data set
gb_tuned_model.fit(X_train_over,y_train_over)
GradientBoostingClassifier(init=DecisionTreeClassifier(random_state=1),
learning_rate=0.05, max_features=0.7, random_state=1,
subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=DecisionTreeClassifier(random_state=1),
learning_rate=0.05, max_features=0.7, random_state=1,
subsample=0.7)DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
# Calculating different metrics on train set
gb_train_over = model_performance_classification_sklearn(
gb_tuned_model, X_train_over, y_train_over
)
print("Training performance:")
gb_train_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating different metrics on Validation set
gb_val_over = model_performance_classification_sklearn(
gb_tuned_model, X_val, y_val
)
print("Validation performance:")
gb_val_over
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.915104 | 0.794479 | 0.711538 | 0.750725 |
# Train model with Undersampled data set
gb_tuned_model.fit(X_train_un,y_train_un)
GradientBoostingClassifier(init=DecisionTreeClassifier(random_state=1),
learning_rate=0.05, max_features=0.7, random_state=1,
subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=DecisionTreeClassifier(random_state=1),
learning_rate=0.05, max_features=0.7, random_state=1,
subsample=0.7)DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
# Calculating different metrics on train set
gb_train_under = model_performance_classification_sklearn(
gb_tuned_model, X_train_un, y_train_un
)
print("Training performance:")
gb_train_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating different metrics on Validation set
gb_val_under = model_performance_classification_sklearn(
gb_tuned_model, X_val, y_val
)
print("Validation performance:")
gb_val_under
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.890918 | 0.907975 | 0.607803 | 0.728167 |
# defining XGBoost model
xgb_tuned_model = XGBClassifier(random_state=RS, eval_metric="logloss")
# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
#Calling RandomizedSearchCV for XGBoost
xgb_cv = RandomizedSearchCV(estimator=xgb_tuned_model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=RS)
#Fitting parameters in RandomizedSearchCV
xgb_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(xgb_cv.best_params_,gboost_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.7694348508634222:
# Build a XGBoost model for paremeters from xgb_cv
xgb_tuned_model = XGBClassifier(random_state=RS,**xgb_cv.best_params_)
# Train model with original data set
xgb_tuned_model.fit(X_train,y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)# Calculating different metrics on train set
xgb_train = model_performance_classification_sklearn(
xgb_tuned_model, X_train, y_train
)
print("Training performance:")
xgb_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.975144 | 0.993852 | 0.869955 | 0.927786 |
# Calculating different metrics on Validation set
xgb_val = model_performance_classification_sklearn(
xgb_tuned_model, X_val, y_val
)
print("Validation performance:")
xgb_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.954097 | 0.929448 | 0.812332 | 0.866953 |
# Train model with Oversampled data set
xgb_tuned_model.fit(X_train_over,y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)# Calculating different metrics on train set
xgb_train_over = model_performance_classification_sklearn(
xgb_tuned_model, X_train_over, y_train_over
)
print("Training performance:")
xgb_train_over
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.966072 | 0.999608 | 0.936776 | 0.967173 |
# Calculating different metrics on Validation set
xgb_val_over = model_performance_classification_sklearn(
xgb_tuned_model, X_val, y_val
)
print("Validation performance:")
xgb_val_over
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.925962 | 0.95092 | 0.698198 | 0.805195 |
# Train model with Undersampled data set
xgb_tuned_model.fit(X_train_un,y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)# Calculating different metrics on train set
xgb_train_under = model_performance_classification_sklearn(
xgb_tuned_model, X_train_un, y_train_un
)
print("Training performance:")
xgb_train_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95748 | 1.0 | 0.921624 | 0.959214 |
# Calculating different metrics on Validation set
xgb_val_under = model_performance_classification_sklearn(
xgb_tuned_model, X_val, y_val
)
print("Validation performance:")
xgb_val_under
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.897335 | 0.98773 | 0.612167 | 0.755869 |
# training performance comparison
tuned_models_train_comp_df = pd.concat(
[
dt_train.T,
dt_train_over.T,
dt_train_under.T,
bag_train.T,
bag_train_over.T,
bag_train_under.T,
rf_train.T,
rf_train_over.T,
rf_train_under.T,
ad_train.T,
ad_train_over.T,
ad_train_under.T,
gb_train.T,
gb_train_over.T,
gb_train_under.T,
xgb_train.T,
xgb_train_over.T,
xgb_train_under.T,
],
axis=1,
)
tuned_models_train_comp_df.columns = [
"Decision Tree Tuned with Original Data",
"Decision Tree Tuned with Oversampled Data",
"Decision Tree Tuned with Undersampled Data",
"Bagging Tuned with Original Data",
"Bagging Tuned with Oversampled Data",
"Bagging Tuned with Undersampled Data",
"Random Forest Tuned with Original Data",
"Random Forest Tuned with Oversampled Data",
"Random Forest Tuned with Undersampled Data",
"Adaboost Tuned with Original Data",
"Adaboost Tuned with Oversampled Data",
"Adaboost Tuned with Undersampled Data",
"GradientBoosting Tuned with Original Data",
"GradientBoosting Tuned with Oversampled Data",
"GradientBoosting Tuned with Undersampled Data",
"Xgboost Tuned with Original Data",
"Xgboost Tuned with Oversampled Data",
"Xgboost Tuned with Undersampled Data",
]
# Validation performance comparison
tuned_models_val_comp_df = pd.concat(
[
dt_val.T,
dt_val_over.T,
dt_val_under.T,
bag_val.T,
bag_val_over.T,
bag_val_under.T,
rf_val.T,
rf_val_over.T,
rf_val_under.T,
ad_val.T,
ad_val_over.T,
ad_val_under.T,
gb_val.T,
gb_val_over.T,
gb_val_under.T,
xgb_val.T,
xgb_val_over.T,
xgb_val_under.T,
],
axis=1,
)
tuned_models_val_comp_df.columns = [
"Decision Tree Tuned with Original Data",
"Decision Tree Tuned with Oversampled Data",
"Decision Tree Tuned with Undersampled Data",
"Bagging Tuned with Original Data",
"Bagging Tuned with Oversampled Data",
"Bagging Tuned with Undersampled Data",
"Random Forest Tuned with Original Data",
"Random Forest Tuned with Oversampled Data",
"Random Forest Tuned with Undersampled Data",
"Adaboost Tuned with Original Data",
"Adaboost Tuned with Oversampled Data",
"Adaboost Tuned with Undersampled Data",
"GradientBoosting Tuned with Original Data",
"GradientBoosting Tuned with Oversampled Data",
"GradientBoosting Tuned with Undersampled Data",
"Xgboost Tuned with Original Data",
"Xgboost Tuned with Oversampled Data",
"Xgboost Tuned with Undersampled Data",
]
print("Tuned Models Training performance comparison:")
tuned_models_train_comp_df
Tuned Models Training performance comparison:
| Decision Tree Tuned with Original Data | Decision Tree Tuned with Oversampled Data | Decision Tree Tuned with Undersampled Data | Bagging Tuned with Original Data | Bagging Tuned with Oversampled Data | Bagging Tuned with Undersampled Data | Random Forest Tuned with Original Data | Random Forest Tuned with Oversampled Data | Random Forest Tuned with Undersampled Data | Adaboost Tuned with Original Data | Adaboost Tuned with Oversampled Data | Adaboost Tuned with Undersampled Data | GradientBoosting Tuned with Original Data | GradientBoosting Tuned with Oversampled Data | GradientBoosting Tuned with Undersampled Data | Xgboost Tuned with Original Data | Xgboost Tuned with Oversampled Data | Xgboost Tuned with Undersampled Data | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.938765 | 0.916356 | 0.913934 | 0.999506 | 0.999902 | 1.0 | 0.996214 | 0.999019 | 0.996926 | 0.982716 | 0.987645 | 0.991291 | 1.0 | 1.0 | 1.0 | 0.975144 | 0.966072 | 0.957480 |
| Recall | 0.805328 | 0.917827 | 0.939549 | 0.997951 | 1.000000 | 1.0 | 0.978484 | 0.999804 | 0.998975 | 0.926230 | 0.983918 | 0.996926 | 1.0 | 1.0 | 1.0 | 0.993852 | 0.999608 | 1.000000 |
| Precision | 0.811983 | 0.915135 | 0.893762 | 0.998974 | 0.999804 | 1.0 | 0.997910 | 0.998238 | 0.994898 | 0.964781 | 0.991306 | 0.985816 | 1.0 | 1.0 | 1.0 | 0.869955 | 0.936776 | 0.921624 |
| F1 | 0.808642 | 0.916479 | 0.916084 | 0.998462 | 0.999902 | 1.0 | 0.988101 | 0.999020 | 0.996933 | 0.945112 | 0.987598 | 0.991340 | 1.0 | 1.0 | 1.0 | 0.927786 | 0.967173 | 0.959214 |
print("Tuned Models Validation performance comparison:")
tuned_models_val_comp_df
Tuned Models Validation performance comparison:
| Decision Tree Tuned with Original Data | Decision Tree Tuned with Oversampled Data | Decision Tree Tuned with Undersampled Data | Bagging Tuned with Original Data | Bagging Tuned with Oversampled Data | Bagging Tuned with Undersampled Data | Random Forest Tuned with Original Data | Random Forest Tuned with Oversampled Data | Random Forest Tuned with Undersampled Data | Adaboost Tuned with Original Data | Adaboost Tuned with Oversampled Data | Adaboost Tuned with Undersampled Data | GradientBoosting Tuned with Original Data | GradientBoosting Tuned with Oversampled Data | GradientBoosting Tuned with Undersampled Data | Xgboost Tuned with Original Data | Xgboost Tuned with Oversampled Data | Xgboost Tuned with Undersampled Data | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.929911 | 0.911649 | 0.867226 | 0.964462 | 0.950642 | 0.928430 | 0.959033 | 0.953603 | 0.929418 | 0.969891 | 0.969398 | 0.935341 | 0.938796 | 0.915104 | 0.890918 | 0.954097 | 0.925962 | 0.897335 |
| Recall | 0.782209 | 0.892638 | 0.920245 | 0.846626 | 0.877301 | 0.947853 | 0.794479 | 0.871166 | 0.920245 | 0.871166 | 0.886503 | 0.963190 | 0.815951 | 0.794479 | 0.907975 | 0.929448 | 0.950920 | 0.987730 |
| Precision | 0.782209 | 0.668966 | 0.552486 | 0.926174 | 0.826590 | 0.707094 | 0.941818 | 0.845238 | 0.719424 | 0.937294 | 0.920382 | 0.725173 | 0.806061 | 0.711538 | 0.607803 | 0.812332 | 0.698198 | 0.612167 |
| F1 | 0.782209 | 0.764783 | 0.690449 | 0.884615 | 0.851190 | 0.809961 | 0.861897 | 0.858006 | 0.807537 | 0.903021 | 0.903125 | 0.827404 | 0.810976 | 0.750725 | 0.728167 | 0.866953 | 0.805195 | 0.755869 |
Observations
After comparing the performance of the Recall score among all tuned models with original, oversampled, and undersampled datasets for training on the validation dataset, XGBoost emerged as the clear winner.
Among the different XGBoost models, we will choose the one trained on the oversampled dataset. This model not only has the highest Recall score but also comparable higher Precision and F1 scores.
Due to the nature of the requirement, we mainly focused on the Recall score while choosing the model, with Precision and F1 scores being secondary priorities.
# Train model with Oversampled data set
xgb_tuned_model.fit(X_train_over,y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=1, ...)# Calculating different metrics on Test set for Xgboost Tuned with Oversampled Data
xgb_test_over = model_performance_classification_sklearn(
xgb_tuned_model, X_test, y_test
)
print("Test performance:")
xgb_test_over
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.929911 | 0.984615 | 0.700219 | 0.818414 |
# print confusion matrix on train set
plot_confusion_matrix(xgb_tuned_model, X_train_over, y_train_over)
# print confusion matrix on Validation set
plot_confusion_matrix(xgb_tuned_model, X_val, y_val)
# print confusion matrix on test set
plot_confusion_matrix(xgb_tuned_model, X_test, y_test)
Observations
feature_names = X_train.columns
importances = xgb_tuned_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The top 6 features for the Tuned XGBoost model are
Features such as Card Category, Education Level, and Months on book are less significant in predicting whether a customer will churn.
Based on the analysis and the identified features from the machine learning model, here are some key insights and recommendations for Thera Bank to help reduce customer attrition and improve their credit card services:
Key Business Insights
Education and Card Category:
Product Usage:
Activity Levels:
Customer Contacts:
Transaction Patterns:
Recommendations
Enhance Transaction Benefits:
Increase Product Engagement:
Address Activity-Related Dissatisfaction:
Optimize Customer Support:
By focusing on these areas, Thera Bank can improve customer satisfaction, reduce churn, and ultimately enhance the profitability of their credit card services.